Feature Dynamic Bayesian Networks 



Marcus Hutter 

RSISE @ ANU and SML @ NICTA 
Canberra, ACT, 0200, Australia 
marcusShutterl .net www .hutter 1 .net 



00 

o 
o 

(N 
o 

Q 

in 



< 



00 

oo 

o 



X 



24 December 2008 



Abstract 



Feature Markov Decision Processes ($MDPs) 
[Hut09] are well-suited for learning agents in gen- 
eral environments. Nevertheless, unstructured 
($)MDPs are limited to relatively simple environ- 
ments. Structured MDPs like Dynamic Bayesian 
Networks (DBNs) are used for large-scale real- 
world problems. In this article I extend $MDP 
to <I>DBN. The primary contribution is to derive a 
cost criterion that allows to automatically extract 
the most relevant features from the environment, 
leading to the "best" DBN representation. I dis- 
cuss all building blocks required for a complete 
general learning algorithm. 

Keywords: Reinforcement learning; dynamic 
Bayesian network; structure learning; feature 
learning; global vs. local reward; explore-exploit. 

1 Introduction 

Agents. The agent-environment setup in which an Agent 
interacts with an Environment is a very general and preva- 
lent framework for studying intelligent learning systems 
|RN03| . In cycles i = 1,2,3,..., the environment provides 
a (regular) observation Ot & O (e.g. a camera image) to 
the agent; then the agent chooses an action at&A (e.g. a 
limb movement); finally the environment provides a real- 
valued reward rt&lR to the agent. The reward may be very 
scarce, e.g. just -1-1 (-1) for winning (losing) a chess game, 
and at all other times [HutOSl Sec. 6. 3]. Then the next 
cycle t+1 starts. The agent's objective is to maximize his 
reward. 

Environments. For example, sequence prediction is con- 
cerned with environments that do not react to the agents 
actions (e.g. a weather- forecasting "action") [Hut03| . plan- 
ning deals with the case where the environmental function 
is known IRPPCdOS] . classification and regression is for 
conditionally independent observations [Bis06] . Markov 
Decision Processes (MDPs) assume that Oj and r* only 
depend on at-i and Ot-i [SI398|, POMDPs deal with Par- 
tially Observable MDPs [KLC9^, and Dynamic Bayesian 
Networks (DBNs) with structured MDPs |BDH99| . 



Feature MDPs [Hut09| . Concrete real- world prob- 
lems can often be modeled as MDPs. For this purpose, 
a designer extracts relevant features from the history 
(e.g. position and velocity of all objects), i.e. the history 
ht = aiOiri...at-iOt-irt-iOt is summarized by a feature 
vector st := ^{ht). The feature vectors are regarded as 
states of an MDP and are assumed to be (approximately) 
Markov. 

Artificial General Intelligence (AGI) [GP07| is con- 
cerned with designing agents that perform well in a very 
large range of environments [LH07 , including all of the 
mentioned ones above and more. In this general situa- 
tion, it is not a priori clear what the useful features are. 
Indeed, any observation in the (far) past may be relevant 
in the future. A solution suggested in [Hut09| is to learn 
$ itself. 

If $ keeps too much of the history (e.g. ^(/if) = ht), 
the resulting MDP is too large (infinite) and cannot be 
learned. If $ keeps too little, the resulting state sequence 
is not Markov. The Cost criterion I develop formalizes this 
tradeoff and is minimized for the "best" <&. At any time 
n, the best $ is the one that minimizes the Markov code 
length of si...s„ and ri...r„. This reminds but is actually 
quite different from MDL, which minimizes model+data 
code length [GriiOTj . 

Dynamic Bayesian networks. The use of "unstruc- 
tured" MDPs jHut09| ■ even our <I>-optimal ones, is clearly 
limited to relatively simple tasks. Real-world problems 
are structured and can often be represented by dynamic 
Bayesian networks (DBNs) with a reasonable number of 
nodes f DK89| . Bayesian networks in general and DBNs 
in particular are powerful tools for modeling and solv- 
ing complex real- world problems. Advances in theory and 
increase in computation power constantly broaden their 
range of applicability [ BDH99. SDLOTj . 

Main contribution. The primary contribution of this 
work is to extend the $ selection principle developed in 
fllut09] for MDPs to the conceptually much more demand- 
ing DBN case. The major extra complications are approx- 
imating, learning and coding the rewards, the dependence 
of the Cost criterion on the DBN structure, learning the 
DBN structure, and how to store and find the optimal 
value function and policy. 
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Although this article is self-contained, it is recom- 
mended to read ( Hut09 | first. 

2 Feature Dynamic Bayesian Networks 
($DBN) 

In this section 1 recapitulate the definition of <i>MDP from 
[Hut09| . and adapt it to DBNs. While formally a DBN 
is just a special case of an MDP, exploiting the additional 
structure efficiently is a challenge. For generic MDPs, typ- 
ical algorithms should be polynomial and can at best be 
linear in the number of states \S\. For DBNs we want 
algorithms that are polynomial in the number of features 
m. Such DBNs have exponentially many states (2'^*^™^), 
hence the standard MDP algorithms are exponential, not 
polynomial, in m. Deriving poly-time (and poly-space!) 
algorithms for DBNs by exploiting the additional DBN 
structure is the challenge. The gain is that we can handle 
exponentially large structured MDPs efficiently. 

Notation. Throughout this article, log denotes the bi- 



nary logarithm, and Sx 



1 if X = y and else is 



the Kronecker symbol. I generally omit separating com- 
mas if no confusion arises, in particular in indices. For 
any z of suitable type (string,vector,set), I define string 
z = zia = ZI...ZI, sum z+ = union z* = Uj^j", ^^'^ 

vector z, — (zi,...,z;), where j ranges over the full range 
{1,...,^} and I — \z\ is the length or dimension or size of 
z. z denotes an estimate of z. The characteristic func- 
tion Is = 1 if i?=truc and else. P(-) denotes a prob- 
ability over states and rewards or parts thereof. I do 
not distinguish between random variables Z and realiza- 
tions z, and abbreviation P(z) :^P[Z — z] never leads to 
confusion. More specifically, m d IN denotes the num- 
ber of features, i € {l,...,m} any feature, n £ IN the cur- 
rent time, and i £ any time. Further, in order 
not to get distracted at several places I gloss over ini- 
tial conditions or special cases where inessential. Also 
* undefined^ * infinity := . 

*MDP definition. A <I>MDP consists of a 7 tu- 
pel (C',^,7?.,Agent,Env,<i>,5) — (observation space, action 
space, reward space, agent, environment, feature map, 
state space). Without much loss of generality, I assume 
that A and O are finite and TZCIR. Implicitly I assume 
A to be small, while O may be huge. 

Agent and Env are a pair or triple of interlocking func- 
tions of the history n:= {Ox AxTZ)* xO: 

Env : H X A X TZ O, o„ = Env(/i„_ia„_ir„_i), 
Agent : H ^ A, a-n = Agent(/i„), 

Env : Ti. X A-^ TZ, Tn = Env(/i„a„). 

where indicates that mappings — s- might be stochas- 
tic. The informal goal of AI is to design an Agent() that 
achieves high (expected) reward over the agent's lifetime 
in a large range of Env()ironments. 

The feature map $ maps histories to states 

$ : ^ 5, st = ^{ht), ht = oari.,t-iOt G H 



The idea is that $ shall extract the "relevant" aspects 
of the history in the sense that "compressed" history 
sari:„ = siairi...s„a„r„ can well be described as a sample 
from some MDP {S,A,T,R) — (state space, action space, 
transition probability, reward function). 

{^) Dynamic Bayesian Netvi^orks are structured 
($)MDPs. The state space is 5 = {0,1}'", and each state 
s = x = {x^,...,x'^) € 5 is interpreted as a feature vector 
x = ^{h), where a;* = $'(/i) is the value of the ith binary 
feature. In the following I will also refer to as feature 
i, although strictly speaking it is its value. Since non- 
binary features can be realized as a list of binary features, 
I restrict myself to the latter. 

Given Xt-i=x, I assume that the features (xj ,...,a;™) = 
x' at time t are independent, and that each depends 
only on a subset of "parent" features tx* C i.e. 
the transition matrix has the structure 



Tl^x'^^i^t = x'\xt-i = x,at-i 



l[P^x'^\u^) (1) 



i=l 



This defines our *DBN model. It is just a $MDP with 
special S and T. Explaining <I>DBN on an example is 
easier than staying general. 

3 #DBN Example 

Consider an instantiation of the simple vacuum world 
[ RN03[ Sec. 3. 6]. There are two rooms, A and B, and a 
vacuum i?obot that can observe whether the room he is 
in is Clean or I?irty; Move to the other room, S'uck, i.e. 
clean the room he is in; or do A^othing. After 3 days a 
room gets dirty again. Every clean room gives a reward 
1, but a moving or sucking robot costs and hence reduces 
the reward by 1. Hence = {A,B}x{C,D}, A={N,S,M}, 
7?, = {—1,0,1,2}, and the dynamics Env() (possible histo- 
ries) is clear from the above description. 

Dynamics as a DBN. We can model the dynamics by 
a DBN as follows: The state is modeled by 3 features. 
Feature Rg{A,B} stores in which room the robot is, and 
feature A/Bs {0,1,2,3} remembers (capped at 3) how long 
ago the robot has cleaned room A/B last time, hence iS = 
{0,1,2,3} X {A,B} X {0,1,2,3}. The state/feature transition 
is as follows: 

if {x'^^A and a = S) then x'^=0 else x'^=min{x'^-|-l, 3}; 
if {x'^-B and a = S) then x'^=G else x'-^=min{a;^-|-l, 3}; 
if a = A/ (if x^^B then x'"=A else x'^^B) else x'^'=x^- 



A DBN can be viewed as a two-layer Bayesian network 
|BDH99j . The dependency structure of our example is 
depicted in the right diagram. 

Each feature consists of a (left,right)- 
pair of nodes, and a node i G {1,2,3 = 
m}^{A,R,B} on the right is connected to 
all and only the parent features m' on the 
left. The reward is 
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The features map $ — {^^,^^,^^) can also be written 
down explicitly. It depends on the actions and observa- 
tions of the last 3 time steps. 

Discussion. Note that all nodes a;'* can implicitly also de- 
pend on the chosen action a. The optimal policies are rep- 
etitions of action sequence S,N,M or S,AI,N. One might 
think that binary features x^^-^ € {C,D} are sufficient, 
but this would result in a POMDP (Partially Observable 
MDP), since the cleanness of room A is not observed while 
the robot is in room B. That is, x' would not be a (proba- 
bilistic) function of x and a alone. The quaternary feature 
x'^ £ {0,1,2,3} can easily be converted into two binary fea- 
tures, and similarly . The purely deterministic exam- 
ple can easily be made stochastic. For instance, S'ucking 
and Moving may fail with a certain probability. Possible, 
but more complicated is to model a probabilistic transi- 
tion from Clean to Dirty. In the randomized versions the 
agent needs to use its observations. 

4 $DBN Coding and Evaluation 

I now construct a code for si-n given ai:„, and for ri-n 
given si;n and ai:„, which is optimal (minimal) if si:„ri:„ 
given ai;n is sampled from some MDP. It constitutes our 
cost function for # and is used to define the $ selection 
principle for DBNs. Compared to the MDP case, reward 
coding is more complex, and there is an extra dependence 
on the graphical structure of the DBN. 

Recall |Hut09| that a sequence zi-^ with counts n = 
(ni,...,nm) can within an additive constant be coded in 



CL(n) ■.= nH{n/j 



ni' — 1 



logn if n>0 and else (2) 



bits, where n — n+ = ni + + and m' = |{i : > 0}| < 
m is the number of non-empty categories, and H{p) :— 
— ^™ j^Pilogpi is the entropy of probability distribution p. 
The code is optimal (within +0(1)) for all i.i.d. sources. 

State/Feature Coding. Similarly to the $MDP case, 
we need to code the temporal "observed" state=feature 
sequence xi;n- I do this by a frequency estimate of the 
state/feature transition probability. (Within an additive 
constant, MDL, MML, combinatorial, incremental, and 
Bayesian coding all lead to the same result). In the fol- 
lowing I will drop the prime in (it*,a,a:'*) tuples and re- 
lated situations if/since it does not lead to confusion. Let 
T^x^ ={i<'n: Uf-i = u\at-i = a,x\ = x*} be the set of 
times t — 1 at which features that influence have val- 
ues M% and action is a, and which leads to feature i hav- 
ing value cc'. Let n'" i = IT'? A their number (n',* —n 
Vi). I estimate each feature probability separately by 
P^ix'lu')^!^^,^,/!^^,^. Using ©, this yields 

n n rn 

t=i t=i 

= ... = exp[5] 



i=l i=l 



The length of the Shannon-Fano code of Xi-n is just the 
logarithm of this expression. We also need to code each 
non-zero count accuracy 0(1/ \/n^u%), which each 

needs ■^log(n^i^) bits. Together this gives a complete code 
of length 



CL(a=i.„|ai.„) = CL«-.) 



(3) 



The rewards are more complicated. 

Reward structure. Let R'^^i be (a model of) the ob- 
served reward when action a in state x results in state 
x'. It is natural to assume that the structure of the re- 
wards i?^^/ is related to the transition structure T^^, . In- 
deed, this is not restrictive, since one can always consider 
a DBN with the union of transition and reward depen- 
dencies. Usually it is assumed that the "global" reward 
is a sum of "local" rewards one for each feature 

i |KP99| . For simplicity of exposition I assume that the 
local reward only depends on the feature value x'^ and 
not on and a. Even this is not restrictive and actually 
may be advantageous as discussed in |Hut09| for MDPs. 
So I assume 



1=1 



Rl 



=: R{x') 



For instance, in the example of Scction[21 two local rewards 
(i?'^,A =lli;'-4<3 and R^,B =1j;'b<3) depend on x' only, but 
the third reward depends on the action [R^ — ~\a^N)- 

Often it is assumed that the local rewards are directly 
observed or known |KP99j . but we neither want nor can 
do this here: Having to specify many local rewards is an 
extra burden for the environment (e.g. the teacher), which 
preferably should be avoided. In our case, it is not even 
possible to pre-specify a local reward for each feature, since 
the features <i>* themselves are learned by the agent and 
are not statically available. They are agent-internal and 
not part of the $DBN interface. In case multiple rewards 
are available, they can be modeled as part of the regular 
observations o, and r only holds the overall reward. The 
agent must and can learn to interpret and exploit the local 
rewards in o by himself. 

Learning the reward function. In analogy to the MDP 
case for R and the DBN case for T above it is tempting to 
estimate by ,r' , I vH'^ , but this makes no sense. 

x^ -J i-^r +x^ I +x^ 

For instance if rt = 1 Vt, then = 1, and R'^^i = m is 
a gross mis-estimation of = 1. The localization of the 
global reward is somewhat more complicated. The goal is 
to choose such that rt=R{xt)yt. 

Without loss we can set Rq = 0, since we can subtract a 
constant from each local reward and absorb them into an 
overall constant wq. This allows us to write 



wqx'^ + WlX^ 



R{x) 

where Wi:=R\ and := 1 



+ ... + WraX 
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In practice, the <i>DBN model will not be perfect, and 
an approximate solution, e.g. a least squares fit, is the best 
we can achieve. The square loss can be written as 

n 

Loss(i(;) '^{R{xt) - = w^Aw - 2b^w + c (4) 
t=i 

n n n 

t=i t=i t=i 

Note that Aij counts the number of times feature i and j 
are "on" (=1) simultaneously, and bi sums all rewards for 
which feature i is on. The loss is minimized for 

w :— argminLoss(i(;) = A^^b, R[x) = vb^x 

w 

which involves an inversion of the (m+l)x (rn+l) matrix 
A. For singular A we take the pseudo-inverse. 

Reward coding. The quadratic loss function suggests a 
Gaussian model for the rewards: 

P{ri.,„\w,c7) := exp(-Loss(tB)/2cr2)/(27rcr2)"/2 

Maximizing this w.r.t. the variance cr^ yields the maximum 
likelihood estimate 

-\ogP{n.,r,\w,a) = |log(Loss(tD))-f logf 

where = Loss(i&)/n. Given w and a this can be re- 
garded as the (Shannon-Fano) code length of ri:„ (there 
are actually a few subtleties here which I gloss over). Each 
weight Wk and a need also be coded to accuracy 0{l/y/n), 
which needs (m-|-2)ilogn bits total. Together this gives a 
complete code of length 

CL(ri:„|a;i:„ai:„) = (5) 
= f log(Loss(w>))-h2i±2logn-f logf^ 

$DBN evaluation and selection is similar to the MDP 
case. Let G denote the graphical structure of the DBN, 
i.e. the set of parents Pa* C {l,...,m} of each feature i. 
(Remember are the parent values). Similarly to the 
MDP case, the cost of {^,G) on /i„ is defined as 

Goat{^,G\hn) ■■= CL(a;i:„|ai:„) -I- CL(ri:„|a;i:„,ai:„), 

(6) 

and the best {^,G) minimizes this cost. 

(^^best^^best^ ._ argmin{Cost(*,G|/i„)} 

A general discussion why this is a good criterion can be 
found in [Hut09j . In the following section I mainly high- 
light the difference to the MDP case, in particular the 
additional dependence on and optimization over G. 



5 DBN Structure Learning & Updating 

This section briefly discusses minimization of ^ w.r.t. G 
given $ and even briefer minimization w.r.t. For the 
moment regard $ as given and fixed. 

Cost and DBN structure. For general structured local 
rewards R'!^i^,i, © and ([5]) both depend on G, and ^ 
represents a novel DBN structure learning criterion that 
includes the rewards. 

For our simple reward model i?' ; , ^ is independent 
of G, hence only ([3]) needs to be considered. This is a 
standard MDL criterion, but I have not seen it used in 
DBNs before. Further, the features i are independent in 
the sense that we can search for the optimal parent sets 
Pa' C {l,...,m} for each feature i separately. 

Complexity of structure search. Even in this case, 
finding the optimal DBN structure is generally hard. In 
principle we could rely on off-the-shelf heuristic search 
methods for finding good G, but it is probably better to 
use or develop some special purpose optimizer. One may 
even restrict the space of considered graphs G to those for 
which © can be minimized w.r.t. G efficiently, as long as 
this restriction can be compensated by "smarter" 

A brute force exhaustive search algorithm for Pa* is to 
consider all 2™ subsets of {I,...,m} and select the one that 
minimizes V i CL(n*° ). A reasonable and often em- 
ployed assumption is to limit the number of parents to 
some small value p, which reduces the search space size to 
O(toP). 

Indeed, since the Gost is exponential in the maximal 
number of parents of a feature, but only linear in n, a 
Cost minimizing $ can usually not have more than a log- 
arithmic number of parents, which leads to a search space 
that is pseudo-polynomial in m. 

Heuristic structure search. We could also replace the 
well-founded criterion ([3]) by some heuristic. One such 
heuristic has been developed in [SDL07| . The mutual in- 
formation is another popular criterion for determining the 
dependency of two random variables, so we could add j 
as a parent of feature i if the mutual information of 
and x"" is above a certain threshold. Overall this takes 
time 0{nn?) to determine G. An MDL inspired threshold 
for binary random variables is ^logn. Since the mutual 
information treats parents independently, T has to be es- 
timated accordingly, essentially as in naive Bayes classifi- 
cation |Lew98) with feature selection, where x'* represents 
the class label and u* are the features selected x. The 
improved Tree-Augmented naive Bayes (TAN) classifier 
|FGG97| could be used to model synchronous feature de- 
pendencies (i.e. within a time slice). The Ghow-Liu |GL68| 
minimum spanning tree algorithm allows determining G in 
time O(m^). A tree becomes a forest if we employ a lower 
threshold for the mutual information. 

$ search is even harder than structure search, and re- 
mains an art. Nevertheless the reduction of the complex 
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(ill-defined) reinforcement learning problem to an inter- 
nal feature search problem with well-defined objective is a 
clear conceptual advance. 

In principle (but not in practice) we could consider 
the set of all (computable) functions {$ : 7i {0,1}}. 
We then compute Cost($|/i) for every finite subset $ = 
..,$*""} and take the minimum (note that the order 
is irrelevant). 

Most practical search algorithms require the specifica- 
tion of some neighborhood function, here for For in- 
stance, stochastic search algorithms suggest and accept 
a neighbor of $ with a probability that depends on the 
Cost reduction. See |Hut09j for more details. Here I will 
only present some very simplistic ideas for features and 
neighborhoods. 

Assume binary observations = 10,1} and consider the 
last TO observations as features, i.e. $*(/i„) = o„_i+i and 
$(/t„) = ($^(fe„),...,$"'(/tn)) = On-,n+ i:„. So the states are 
the same as for $„MDP in |Hut09| . but now 5 = {0,1}™ 
is structured as to binary features. In the example here, 
TO = 5 lead to a perfect <I>DBN. We can add a new feature 
On-m {m"^m+l) or remove the last feature (to'^to— 1), 
which defines a natural neighborhood structure. 

Note that the context trees of |McC96i IHutOQj are more 
flexible. To achieve this flexibility here we either have to 
use smarter features within our framework (simply inter- 
pret s = ^s{h) as a feature vector of length to= [log|5|]) 
or use smarter (non-tabular) estimates of P"(x'|m') ex- 
tending our framework (to tree dependencies). 

For general purpose intelligent agents we clearly 
need more powerful features. Logical expressions or 
(non) accepting Turing machines or recursive sets can map 
histories or parts thereof into true/false or accept/reject 
or in/out, respectively, hence naturally represent binary 
features. Randomly generating such expressions or pro- 
grams with an appropriate bias towards simple ones is a 
universal feature generator that eventually finds the opti- 
mal feature map. The idea is known as Universal Search 
[Gj^jOT] . 

6 Value & Policy Learning in $DBN 

Given an estimate $ of the next step is to deter- 

mine a good action for our agent. I mainly concentrate on 
the difficulties one faces in adapting MDP algorithms and 
discuss state of the art DBN algorithms. Value and policy 
learning in known finite state MDPs is easy provided one is 
satisfied with a polynomial time algorithm. Since a DBN 
is just a special (structured) MDP, its (Q) Value function 
respects the same Bellman equations |Hut09[ Eq.(6)], and 
the optimal poHcy is still given by a„+i :=argmaxa(5*°^^. 
Nevertheless, their solution is now a nightmare, since the 
state space is exponential in the number of features. We 
need algorithms that are polynomial in the number of fea- 
tures, i.e. logarithmic in the number of states. 

Value function approximation. The first problem is 
that the optimal value and policy do not respect the struc- 



ture of the DBN. They are usually complex functions of 
the (exponentially many) states, which cannot even be 
stored, not to mention computed |KP99| . It has been sug- 
gested that the value can often be approximated well as a 
sum of local values similarly to the rewards. Such a value 
function can at least be stored. 

Model-based learning. The default quality measure 
for the approximate value is the p-weighted squared dif- 
ference, where p is the stationary distribution. 

Even for a fixed policy, value iteration does not con- 
verge to the best approximation, but usually converges to 
a fixed point close to it jBT96j . Value iteration requires p 
explicitly. Since p is also too large to store, one has to ap- 
proximate p as well. Another problem, as pointed out in 
[KPOOJ, is that policy iteration may not converge, since dif- 
ferent policies have different (misleading) stationary dis- 
tributions. KoUer and Parr [KPOO] devised algorithms for 
general factored p, and Guestrin et al. [GKPV03) for max- 
norm, alleviating this problem. Finally, general policies 
cannot be stored exactly, and another restriction or ap- 
proximation is necessary. 

Model-free learning. Given the difficulties above, I sug- 
gest to (re)consider a very simple class of algorithms, with- 
out suggesting that it is better. The above model-based 
algorithms exploit T and R directly. An alternative is to 
sample from T and use model-free "Temporal Difference 
(TD)" learning algorithms based only on this internal vir- 
tual sample _SB98j. We could use TD(A) or Q- value vari- 
ants with linear value function approximation. 

Beside their simplicity, another advantage is that nei- 
ther the stationary distribution nor the policy needs to be 
stored or approximated. Once approximation Q* has been 
obtained, it is trivial to determine the optimal (w.r.t. Q*) 
action via a„+i = argmax^Q*"^^ for any state of interest 
(namely Xn+i) exactly. 

Exploration. Optimal actions based on approximate 
rather than exact values can lead to very poor behav- 
ior due to lack of exploration. There are polynomially 
optimal algorithms (Rmax,E3,0IM) for the exploration- 
exploitation dilemma. 

For model-based learning, extending E3 to DBNs is 
straightforward, but E3 needs an oracle for planning in 
a given DBN |KK99j . Recently, Strehl et al. |SDL07j ac- 
complished the same for Rmax. They even learn the DBN 
structure, albeit in a very simplistic way. Algorithm OIM 
jSL08| . which I described in [ Hut09j for MDPs, can also 
likely be generalized to DBNs, and I can imagine a model- 
free version. 

7 Incremental Updates 

As discussed in Section [5l most search algorithms are lo- 
cal in the sense that they produce a chain of "slightly" 
modifled candidate solutions, here $'s. This suggests a 
potential speedup by computing quantities of interest in- 
crementally. 
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Cost. Computing CL(a;|a) in ^ takes at most time 
0(m2'^|y^|), where k is the maximal number of parents 
of a feature. If we remove feature i, we can simply re- 
move/subtract the contributions from i in the sum. If we 
add a new feature m+1, we only need to search for the best 
parent set for this new feature, and add the corre- 

sponding code length. In practice, many transitions don't 
occur, i.e. ^ =0, so CL(a;|a) can actually be computed 
much faster in time 0(|{n^i^i >0}|), and incrementally 
even faster. 

Rewards. When adding a new feature, the current local 
reward estimates may not change much. If we reassign a 
fraction a< 1 of reward to the new feature x"'''^^, we get 
the following ansat^ll- 

R{x]...,x"'+^) = {l-a)R{x) + Wm+ix"'+'^ v^iP{x) 
V := (l-a,w,n+iV, ^ ■■= (R(x),x"'+'V 

Minimizing X]"=i(-^(^t ■•■^t"^^)^''*)^ w.r.t. v analogous 
to (gl) just requires a trivial 2x2 matrix inversion. The 
minimum i) results in an initial new estimate w = ((l — 
a)wo,...,{l — a)wm,Wrn+i)^, which can be improved by 
some first order gradient decent algorithm in time 0{m), 
compared to the exact O(to^) algorithm. When removing 
a feature, we simply redistribute its local reward to the 
other features, e.g. uniformly, followed by improvement 
steps that cost 0{m) time. 

Value. All iteration algorithms described in Section [S] for 
computing (Q) Values need an initial value for V or Q. We 
can take the estimate V from a previous $ as an initial 
value for the new Similarly as for the rewards, we can 
redistribute a fraction of the values by solving relatively 
small systems of equations. The result is then used as 
an initial value for the iteration algorithms in Section [G] 
A further speedup can be obtained by using prioritized 
iteration algorithms that concentrate their time on badly 
estimated parameters, which are in our case the new values 
[SB98] . 

Similarly, results from time t can be (re)used as initial 
estimates for the next cycle t + 1, followed by a fast im- 
provement step. 

8 Outlook 

^DBN leaves much more questions open and room for 
modifications and improvements than $MDP. Here are a 
few. 

• The cost function can be improved by integrating out 
the states analogous to the $MDP case |Hut09j : The 
likelihood P(7'i,„|ai:„,f7) is unchanged, except that 
U = TR is now estimated locally, and the complexity 
penalty becomes i(M-|-m-|-2)logn, where M is (es- 
sentially) the number of non-zero counts n'^^t^i, but 
an efhcient algorithm has yet to be found. 

^An Ansatz is an initial mathematical or physical model 
with some free parameters to be determined subsequently, 
[http: / /cn. wikipcdia.org/wiki/Ansatz] 



• It may be necessary to impose and exploit structure 
on the conditional probability tables P°'{x'^\u''') them- 
selves |BUH99) . 

• Real-valued observations and beliefs suggest to ex- 
tend the binary feature model to [0,1] interval valued 
features rather than coding them binary. Since any 
continuous semantics that preserves the role of and 
1 is acceptable, there should be an efficient way to 
generalize Cost and Value estimation procedures. 

• I assumed that the reward/value is linear in local re- 
wards/values. Is this sufficient for all practical pur- 
poses? I also assumed a least squares and Gaussian 
model for the local rewards. There are efficient algo- 
rithms for much more flexible models. The least we 
could do is to code w.r.t. the proper covariance A. 

• I also barely discussed synchronous (within time-slice) 
dependencies. 

• I guess $DBN will often be able to work around too 
restrictive DBN models, by finding features $ that are 
more compatible with the DBN and reward structure. 

• Extra edges in the DBN can improve the linear value 
function approximation. To give <I>DBN incentives to 
do so, the Value would have to be included in the Cost 
criterion. 

• Implicitly I assumed that the action space A is small. 
It is possible to extend ^DBN to large structured 
action spaces. 

• Apart from the ^-search, all parts of $DBN seem 
to be poly-time approximable, which is satisfactory 
in theory. In practice, this needs to be improved to 
essentially linear time in n and m. 

• Developing smart generation and smart stochastic 
search algorithms for $ are the major open challenges. 

• A more Bayesian Cost criterion would be desirable: a 
likelihood of h given $ and a prior over $ leading to a 
posterior of <& given h, or so. Monte Carlo (search) al- 
gorithms like Metropolis-Hastings could sample from 
such a posterior. Currently probabilities (=2~*-^^) are 
assigned only to rewards and states, but not to ob- 
servations and feature maps. 

Summary. In this work I introduced a powerful frame- 
work (<I>DBN) for general-purpose intelligent learning 
agents, and presented algorithms for all required building 
blocks. The introduced cost criterion reduced the informal 
reinforcement learning problem to an internal well-defined 
search for "relevant" features. 
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